Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
基于图形的模型最近在人的重新识别任务中取得了巨大的成功,该任务首先计算了不同人之间的图形拓扑结构(亲和力),然后将信息传递给他们的信息以实现更强的功能。但是,我们在可见的红外人员重新识别任务(VI-REID)中发现了现有的基于图的方法,因为有两个问题:1)火车测试模式平衡差距,这是VI-REID任务的属性。两个模式数据的数量在训练阶段平衡,但推理极为不平衡,导致基于图的VI-REID方法的概括较低。 2)由图形模块的端到端学习方式引起的亚最佳拓扑结构。我们分析训练有素的输入特征会削弱图形拓扑的学习,从而使其在推理过程中不够概括。在本文中,我们提出了一种反事实干预特征转移(CIFT)方法来解决这些问题。具体而言,均匀和异质的特征转移(H2FT)旨在通过两种独立的设计的图形模块和不平衡的场景模拟来减少火车测试模态差距。此外,提出了反事实关系干预(CRI)来利用反事实干预和因果效应工具来突出拓扑结构在整个训练过程中的作用,这使图形拓扑结构更加可靠。对标准VI-REID基准测试的广泛实验表明,CIFT在各种设置下都优于最新方法。
translated by 谷歌翻译
现有的伪造检测方法通常将面部伪造视为二进制分类问题,并采用深层卷积神经网络来学习歧视性特征。理想的判别特征应仅与面部图像的真实/假标签有关。但是,我们观察到,香草分类网络学到的特征与不必要的属性(例如伪造方法和面部身份)相关。这种现象将限制伪造的检测性能,尤其是对于概括能力。在此激励的基础上,我们提出了一种新型方法,该方法利用对抗性学习来消除不同伪造方法和面部身份的负面影响,该方法有助于分类网络学习固有的常见歧视性特征,以进行伪造伪造。为了利用缺乏面部身份的地面真实标签的数据,我们根据来自现成的面部识别模型得出的相似性信息设计了一个特殊的身份歧视器。在对抗性学习的帮助下,我们的伪造检测模型学会了通过消除伪造方法和面部身份的影响来提取共同的歧视特征。广泛的实验证明了该方法在数据集内和交叉数据集评估设置下的有效性。
translated by 谷歌翻译
不同对象之间的闭塞是多对象跟踪(MOT)中的典型挑战,这通常导致由于丢失的检测到的对象导致较差的跟踪结果。多对象跟踪中的常见做法是重新识别出现后的错过对象。虽然重新识别可以提高跟踪性能,但是需要培训型号的身份的注释。此外,这种重新识别的做法仍然不能在探测器错过时跟踪那些高度遮挡的物体。在本文中,我们专注于在线多目标跟踪和设计两种新颖的模块,无监督的重新识别学习模块和遮挡估计模块,处理这些问题。具体地,所提出的无监督重新识别学习模块不需要任何(伪)身份信息,也不需要缩放性问题。所提出的遮挡估计模块尝试预测闭塞发生的位置,其用于估计探测器错过对象的位置。我们的研究表明,当应用于最先进的MOT方法时,所提出的无监督的重新识别学习与监督重新识别学习相当,并且通过所提出的遮挡估计模块进一步改善了跟踪性能。
translated by 谷歌翻译
在现实世界中,物体的发生频率是自然倾斜的形成长尾级分布,这导致统计上罕见的阶级的性能不佳。有希望的解决方案是挖掘尾级示例以平衡培训数据集。但是,采矿尾级示例是一个非常具有挑战性的任务。例如,由于数据中的偏差导致的类概率失真,大多数基于不确定性的挖掘方法接近斗争。在这项工作中,我们提出了一种有效,但简单的方法来克服这些挑战。我们的框架增强了Subdued Tail-Class的激活,此后,使用单级数据为中心的方法来有效地识别尾级示例。我们对三个数据集的框架进行了详尽的评估,这些数据集超过了两台计算机愿景任务。少数民族挖掘和微调模型的性能大量改善强烈证实了我们提出的解决方案的价值。
translated by 谷歌翻译
点云识别是工业机器人和自主驾驶中的重要任务。最近,几个点云处理模型已经实现了最先进的表演。然而,这些方法缺乏旋转稳健性,并且它们的性能严重降低了随机旋转,未能扩展到具有不同方向的现实情景。为此,我们提出了一种名为基于自行轮廓的转换(SCT)的方法,该方法可以灵活地集成到针对任意旋转的各种现有点云识别模型中。 SCT通过引入轮廓感知的转换(CAT)提供有效的旋转和翻译不变性,该转换(CAT)线性地将点数的笛卡尔坐标转换为翻译和旋转 - 不变表示。我们证明猫是一种基于理论分析的旋转和翻译不变的转换。此外,提出了帧对准模块来增强通过捕获轮廓并将基于自平台的帧转换为帧内帧来增强鉴别特征提取。广泛的实验结果表明,SCT在合成和现实世界基准的有效性和效率的任意旋转下表现出最先进的方法。此外,稳健性和一般性评估表明SCT是稳健的,适用于各种点云处理模型,它突出了工业应用中SCT的优势。
translated by 谷歌翻译
A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
translated by 谷歌翻译
Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
translated by 谷歌翻译
In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
translated by 谷歌翻译
Event-based simulations of Spiking Neural Networks (SNNs) are fast and accurate. However, they are rarely used in the context of event-based gradient descent because their implementations on GPUs are difficult. Discretization with the forward Euler method is instead often used with gradient descent techniques but has the disadvantage of being computationally expensive. Moreover, the lack of precision of discretized simulations can create mismatches between the simulated models and analog neuromorphic hardware. In this work, we propose a new exact error-backpropagation through spikes method for SNNs, extending Fast \& Deep to multiple spikes per neuron. We show that our method can be efficiently implemented on GPUs in a fully event-based manner, making it fast to compute and precise enough for analog neuromorphic hardware. Compared to the original Fast \& Deep and the current state-of-the-art event-based gradient-descent algorithms, we demonstrate increased performance on several benchmark datasets with both feedforward and convolutional SNNs. In particular, we show that multi-spike SNNs can have advantages over single-spike networks in terms of convergence, sparsity, classification latency and sensitivity to the dead neuron problem.
translated by 谷歌翻译